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Abstract. Making use of predictions is a crucial, but under-explored, area of 
online algorithms. This paper studies a class of online optimization problems 
where we have external noisy predictions available. We propose a stochas¬ 
tic prediction error model that generalizes prior models in the learning and 
stochastic control communities, incorporates correlation among prediction er¬ 
rors, and captures the fact that predictions improve as time passes. We prove 
that achieving sublinear regret and constant competitive ratio for online al¬ 
gorithms requires the use of an unbounded prediction window in adversarial 
settings, but that under more realistic stochastic prediction error models it is 
possible to use Averaging Eixed Horizon Control (AEHC) to simultaneously 
achieve sublinear regret and constant competitive ratio in expectation using 
only a constant-sized prediction window. Furthermore, we show that the per¬ 
formance of AFHC is tightly concentrated around its mean. 


1. Introduction 

Making use of predictions about the future is a crucial, but under-explored, area 
of online algorithms. In this paper, we use online convex optimization to illustrate 
the insights that can be gained from incorporating a general, realistic model of 
prediction noise into the analysis of online algorithms. 

Online convex optimization. In an online convex optimization (OCO) prob¬ 
lem, a learner interacts with an environment in a sequence of rounds. In round t 
the learner chooses an action xt from a convex decision/action space G, and then 
the environment reveals a convex cost function ct and the learner pays cost Ctixt). 
An algorithm’s goal is to minimize total cost over a (long) horizon T. 

OCO has a rich theory and a wide range of important applications. In computer 
science, it is most associated with the so-called fc-experts problem, an online learn¬ 
ing problem where in each round t the algorithm chooses one of k possible actions, 
viewed as following the advice of one of k “experts”. 

However, OCO is being increasingly broadly applied and, recently has become 
prominent in networking and cloud computing applications, including the design 
of dynamic capacity planning, load shifting and demand response for data centers 
[Za [291 EqI EH ES], geographical load balancing of internet-scale systems [2H1 HZ] , 
electrical vehicle charging [HIES], video streaming IZHEZ] and thermal manage¬ 
ment of systems-on-chip iiaiiTj. 

In typical applications of online convex optimization in networking and cloud 
computing there is an additional cost in each round, termed a “switching cost”, 
that captures the cost of changing actions during a round. Specifically, the cost is 
Ct{xt)+ II Xt — Xt-i II, where || • || is a norm (often the one-norm). This additional 
term makes the online problem more challenging since the optimal choice in a round 
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then depends on future cost functions. These “smoothed” online convex optimiza¬ 
tion problems have received considerable attention in the context of networking 
and cloud computing applications, e.g., [551 HH 13011311131]: and are also relevant 
for many more traditional online convex optimization applications where, in reality, 
there is a cost associated with a change in action, e.g., portfolio management. We 
focus on smoothed online convex optimization problems. 

A mismatch between theory and practice. As OCO algorithms make 
their way into networking and cloud computing applications, it is increasingly clear 
that there is a mismatch between the pessimistic results provided by the theoretical 
analysis (which is typically adversarial) and the near-optimal performance observed 
in practice. 

Concretely, two main performance metrics have been studied in the literature: 
regret, defined as the difference between the cost of the algorithm and the cost of the 
offline optimal static solution, and the competitive ratio, defined as the maximum 
ratio between the cost of the algorithm and the cost of the offline optimal (dynamic) 
solution. 

Within the machine learning community, regret has been heavily studied [2011451 
I49j and there are many simple algorithms that provide provably sublinear regret 
(also called “no regret”). For example, online gradient descent achieves 0{\/T)- 
regret |49j . even when there are switching costs |1]. In contrast, the online algo¬ 
rithms community considers a more general class of problems called “metrical task 
systems” (MTS) and focuses on competitive ratio (T] H] [29] . Most results in this 
literature are “negative”, e.g., when ct are arbitrary, the competitive ratio grows 
without bound as the number of states in the decision space grows [8]. Exceptions 
to such negative results come only when structure is imposed on either the cost 
functions or the decision space, e.g., when the decision space is one-dimensional it 
is possible for an online algorithm to have a constant competitive ratio, e.g., [29] . 
However, even in this simple setting no algorithms performs well for both competi¬ 
tive ratio and regret. No online algorithm can have sublinear regret and a constant 
competitive ratio, even if the decision space is one-dimensional and cost functions 
are linear [4]. 

In contrast to the pessimism of the analytic work, applications in networking and 
cloud computing have shown that OCO algorithms can significantly outperform the 
static optimum while nearly matching the performance of the dynamic optimal, 
i.e., simultaneously do well for regret and the competitive ratio. Examples include 
dynamic capacity management of a single data center nng and geographical load 
balancing across multiple data centers [551 1331133] ■ 

It is tempting to attribute this discrepancy to the fact that practical workloads 
are not adversarial. However, a more important factor is that, in reality, algorithms 
can exploit relatively accurate predictions about the future, such as diurnal variations 
laiiiiET]. But a more important contrast between the theory and application is 
simply that, in reality, predictions about the future are available and accurate, and 
thus play a crucial role in the algorithms. 

Incorporating predictions. It is no surprise that predictions are crucial to 
online algorithms in practice. In OCO, knowledge about future cost functions is 
valuable, even when noisy. However, despite the importance of predictions, we do 
not understand how prediction noise affects the performance (and design) of online 
convex optimization algorithms. 
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This is not due to a lack of effort. Most papers that apply OCO algorithms to 
networking and cloud computing applications study the impact of prediction noise, 
e.g., [IJI1I3I1I35]. Typically, these consider numerical simulations where i.i.d. noise 
terms with different levels of variability are added to the parameter being predicted, 
e.g., [milO]. While this is a valuable first step, it does not provide any guarantees 
about the performance of the algorithm with realistic prediction errors (which tend 
to be correlated, since an overestimate in one period is likely followed by another 
overestimate) and further does not help inform the design of algorithms that can 
effectively use predictions. 

Though most work on predictions has been simulation based, there has also been 
significant work done seeking analytic guarantees. This literature can be categorized 
into: 

(i) Worst-case models of prediction error typically assume that there exists a 
lookahead window ui such that within that window, prediction is near-perfect 
{too optimistic), and outside that window the workload is adversarial {too 
pessimistic), e.g., [8l [29l [2^ IMl 112) . 

(ii) Simple stochastic models of prediction error typically consider i.i.d. errors, 
e.g., [iniisiiisn]- Although this is analytically appealing, it ignores important 
features of prediction errors, as described in the next section. 

(iii) Detailed stochastic models of specific predictors applied for specific signal mod¬ 
els, such as [IHl im [39l [23]. This leads to less pessimistic results, but the 
guarantees, and the algorithms themselves, become too fragile to assumptions 
on the system evolution. 

Contributions of this paper. First, we introduce a general colored noise 
model for studying prediction errors in online convex optimization problems. The 
model captures three important features of real predictors: (i) it allows for arbitrary 
correlations in prediction errors (e.g., both short and long range correlations); (ii) 
the quality of predictions decreases the further in the future we try to look ahead; 
and (iii) predictions about the future are updated as time passes. Further, it strikes 
a middle ground between the worst-case and stochastic approaches. In particular, it 
does not make any assumptions about an underlying stochastic process or the design 
of the predictor. Instead, it only makes (weak) assumptions about the stochastic 
form of the error of the predictor; these assumptions are satisfied by many common 
stochastic models, e.g., the prediction error of standard Weiner filters [44] and 
Kalman filters [24]. Importantly, by being agnostic to the underlying stochastic 
process, the model allows worst-case analysis with respect to the realization of the 
underlying cost functions. 

Second, using this model, we show that a simple algorithm, Averaging Fixed 
Florizon Control (AFHC) |28j . simultaneously achieves sublinear regret and a con¬ 
stant competitive ratio in expectation using very limited prediction, i.e., a prediction 
window of size 0(1), in nearly all situations when it is feasible for an online algo¬ 
rithm to do so (Theorem 12]). Further, we show that the performance of AFHC 
is tightly concentrated around its mean ITheorem ITU)) . Thus, AFHC extracts the 
asymptotically optimal value from predictions. Additionally, our results inform the 
choice of the optimal prediction window size. (For ease of presentation, both The¬ 
orems |9| and [101 are stated and proven for the specific case of online LASSO - see 
Section [2]- but the proof technique can be generalized in a straightforward way.) 
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Importantly, Theorem [5] highlights that the dominant factor impacting whether 
the prediction window should be long or short in AFHC is not the variance of the 
noise, but rather the correlation structure of the noise. For example, if prediction 
errors are i.i.d. then it is optimal for AFHC to look ahead as far as possible (i.e., 
T) regardless of the variance, but if prediction errors have strong short-range de¬ 
pendencies then the optimal prediction window is constant sized regardless of the 
variance. 

Previously, AFHC had only been analyzed in the adversarial model [55] , and our 
results are in stark contrast to the pessimism of prior work. To highlight this, we 
prove that in the “easiest” adversarial model (where predictions are exact within the 
prediction window), no online algorithm can achieve sublinear regret and a constant 
competitive ratio when using a prediction window of constant size (Theorem [T]). 
This contrast emphasizes the value of moving to a more realistic stochastic model 
of prediction error. 


2. Online Convex Optimization 
WITH Switching Costs 


Throughout this paper we consider online convex optimization problems with 
switching costs, i.e., “smoothed” online convex optimization (SOCO) problems. 

2.1. Problem Formulation. The standard formulation of an online optimization 
problem with switching costs considers a convex decision/action space G C M" and 
a sequence of cost functions {ci,C 2 ,...}, where each ct '■ G ^ R’*' is convex. At 
time t, the online algorithm first chooses an action, which is a vector xt G G, the 
environment chooses a cost function ct from a set C, and the algorithm pays a stage 
cost Ct{xt) and a switching cost (3\\xt — Xt-i\\ where /3 G (R'*'). Thus, the total cost 
of the online algorithm is defined to be 



( 1 ) 


where cci,... ,xt are the actions chosen by the algorithm, ALG. Without loss of 
generality, assume the initial action xq = 0; the expectation is over any randomness 
used by the algorithm, and || • || is a seminorm on R". 

Typically, a number of assumptions about the action space G and the cost func¬ 


tions Ct are made to allow positive results to be derived. In particular, the action 


set G is often assumed to be closed, nonempty, and bounded, where by bounded 
we mean that there exists H G R such that for all x,y & G^\\x — y\\ < D. Further, 
the cost functions Ct are assumed to have a uniformly bounded subgradient, i.e., 
there exists N G R"*" such that, for all x £ G, ||Vct(a;)|| < N. 

Since our focus in this paper is on predictions, we consider a variation of the 
above with parameterized cost functions ct{xt',yt), where the parameter yt is the 
focus of prediction. Further, except when considering worst-case predictions, we 
adopt a specihc form of Ct for concreteness. We focus on a tracking problem where 
the online algorithm is trying to do a “smooth” tracking of yt and pays a least 
square penalty each round. 



( 2 ) 


cost (ALG) = E, 
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where the target yt € K™, and K G is a (known) linear map that transforms 

the control variable into the space of the tracking target. Let K'^ be the Moore- 
Penrose pseudoinverse of K. 

We focus on this form because it represents an online version of the LASSO (Least 
Absolute Shrinkage and Selection Operator) formulation, which is widely studied in 
a variety of contexts, e.g., see [miiiiiiiiiiT] and the references therein. Typically 
in LASSO the one-norm regularizer is used to induce sparsity in the solution. In 
our case, this corresponds to specifying that a good solution does not change too 
much, i.e., Xt — Xt-i 0 is infrequent. Importantly, the focus on LASSO, i.e., the 
two-norm loss function and one-norm regularizer, is simply for concreteness and 
ease of presentation. Our proof technique generalizes (at the expense of length and 
complexity). 

We assume that K^K is invertible and that the static optimal solution to ([2]) 
is positive. Neither of these is particularly restrictive. If K has full column rank 
then K is invertible. This is a reasonable, for example, when the dimensionality 
of the action space G is small relative to the output space. Note that typically 
K is designed, and so it can be chosen to ensure these assumptions are satisfied. 
Additionally if K is invertible, then it no longer appears in the results provided. 

Finally, it is important to highlight a few contrasts between the cost function in 
(I2|) and the typical assumptions in the online convex optimization literature. First, 
note that the feasible action set G = R" is unbounded. Second, note that gradient 
of Ct can be arbitrarily large when yt and Kxt are far apart. Thus, both of these 
are relaxed compared to what is typically studied in the online convex optimization 
literature. We show in Section [5] that, we can have sublinear regret even in this 
relaxed setting. 

2.2. Performance Metrics. The performance of online algorithms for SOCO 
problems is typically evaluated via two performance metrics: regret and the compet¬ 
itive ratio. Regret is the dominant choice in the machine learning community and 
competitive ratio is the dominant choice in the online algorithms community. The 
key difference between these measures is whether they compare the performance 
of the online algorithm to the offline optimal static solution or the offline optimal 
dynamic solution. Specifically, the optimal offline static solution, itQ 

T 

(3) S'TA = argmin Ct(a:)-I-/3||a:||, 

xeG ^ 

and the optimal dynamic solution is 

T 

(4) OPT = argmin '^ct{xt) I3\\{xt - Xt-i)\\. 

Definition 1. The regret of an online algorithm, ALG, is less than p{T) if the 
following holds: 

(5) sup cost(ALG) — cost(S'TA) < p{T). 

(ci,...,CT)eC'r 


^One switching cost is incurred due to the fact that we enforce xq = 0. 
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Definition 2. An online algorithm ALG is said to be p{T)-competitive if the 
following holds: 


( 6 ) 


cost(^LG) 

(ci,...,c'r)ec^cost(OPr) 


<PiT) 


The goals are typically to find algorithms with a (small) constant competitive 
ratio (“constant-competitive”) and to find online algorithms with sublinear regret, 
i.e., an algorithm ALG that has regret p{T) bounded above by some p{T) € o(T); 
note that p{T) may be negative if the concept we seek to learn varies dynamically. 
Sublinear regret is also called “no-regret”, since the time-average loss of the online 
algorithm goes to zero as T grows. 


2.3. Background. To this point, there are large literatures studying both the 
designs of no-regret algorithms and the design of constant-competitive algorithms. 
However, in general, these results tell a pessimistic story. 

In particular, on a positive note, it is possible to design simple, no-regret algo¬ 
rithms, e.g., online gradient descent (OGD) based algorithms |491120) and Online 
Newton Step and Follow the Approximate Leader algorithms [20] . (Note that the 
classical setting does not consider switching costs; however, [4] shows that similar 
regret bounds can be obtained when switching costs are considered.) 

However, when one considers the competitive ratio, results are much less op¬ 
timistic. Historically, results about competitive ratio have considered weaker as¬ 
sumptions, i.e., the cost functions Ct and the action set G can be nonconvex, and the 
switching cost is an arbitrary metric d(xt^ Xt-i) rather than a seminorm | \xt—xt-i\\. 
The weakened assumptions, together with the tougher offline target for compari¬ 
son, leads to the fact that most results are “negative”. For example, [5] has shown 
that any deterministic algorithm must be r2(n)-competitive given metric decision 
space of size n. Furthermore, [7] has shown that any randomized algorithm must 
be H(-^/Iog^T7^^^§^^^§^)-competitive. To this point, positive results are only known 
in very special cases. For example, [29] shows that, when G is a one dimensional 
normed space, there exists a deterministic online algorithm that is 3-competitive. 

Results become even more pessimistic when one asks for algorithms that per¬ 
form well for both competitive ratio and regret. Note that performing well for 
both measures is particularly desirable for many networking and cloud computing 
applications where it is necessary to both argue that a dynamic control algorithm 
provides benefits over a static control algorithm (sublinear regret) and is near opti¬ 
mal (competitive ratio). However, a recent result in [1] highlights that such as goal 
is impossible: even when the setting is restricted to a one dimensional normed space 
with linear cost functions no online almrithm can simultaneously achieve sublinear 
regret and constant competitive ratio Q 


3. Modeling Prediction Error 

The adversarial model underlying most prior work on online convex optimization 
has led to results that tend to be pessimistic; however, in reality, algorithms can 
often use predictions about future cost functions in order to perform well. 


^Note that this impossibility is not the result of the regret being additive and the competitive ratio 
being multiplicative, as proves the parallel result for competitive difference, which is an additive 
comparison to the dynamic optimal. 
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Knowing information about future cost functions is clearly valuable for smoothed 
online convex optimization problems, since it allows you to better justify whether 
it is worth it to incur a switching cost during the current stage. Thus, it is not 
surprising that predictions have proven valuable in practice for such problems. 

Given the value of predictions in practice, it is not surprising that there have been 
numerous attempts to incorporate models of prediction error into the analysis of 
online algorithms. We briefly expand upon the worst-case and stochastic approaches 
described in the introduction to motivate our approach, which is an integration of 
the two. 

Worst-case models. Worst-case models of prediction error tend to assume 
that there exists a lookahead window, w, such that within that window, a perfect 
(or near-perfect, e.g., error bounded by e) prediction is available. Then, outside of 
that window the workload is adversarial. A specific example is that, for any t the 
online algorithm knows yt,..., yt+w precisely, while yt+w+i, ■ ■ ■ are adversarial. 

Clearly, such models are both too optimistic about the the predictions used 
and too pessimistic about what is outside the prediction window. The result is 
that algorithms designed using such models tend to be too trusting of short term 
predictions and too wary of unknown fluctuations outside of the prediction window. 
Further, such models tend to underestimate the value of predictions for algorithm 
performance. To illustrate this, we establish the following theorem. 

Theorem 1. For any constant 7 > 0 and any online algorithm A (deterministic or 
randomized) with constant lookahead w, either the competitive ratio of the algorithm 
is at least 7 or its regret, is ^(T). Here T is the number of cost functions in an 
instance. 

The above theorem focuses on the “easiest” worst-case model, i.e., where the 
algorithm is allowed perfect lookahead for w steps. Even in this case, an online 
algorithm must have super-constant lookahead in order to simultaneously have 
sublinear regret and a constant competitive ratio. Further, the proof (given in 
Appendix lA.ll) highlights that this holds even in the scalar setting with linear 
cost functions. Thus, worst-case models are overly pessimistic about the value of 
prediction. 

Stochastic models. Stochastic models tend to come in two forms: (i) i.i.d. 
models or (ii) detailed models of stochastic processes and specihc predictors for 
those processes. 

In the first case, for reasons of tractability, prediction errors are simply assumed 
to be i.i.d. mean zero random variables. While such an assumption is clearly ana¬ 
lytically appealing, it is also quite simplistic and ignores many important features 
of prediction errors. For example, in reality, predictions have increasing error the 
further in time we look ahead due to correlation of predictions errors in nearby 
points in time. Further, predictions tend to be updated or refined as time passes. 
These fundamental characteristics of predictions cannot be captured by the i.i.d. 
model. 

In the second case, which is common in control theory, a specific stochastic 
model for the underlying process is assumed and then an optimal predictor (filter) 
is derived. Examples here include the derivation of Weiner filters and Kalaman 
filters for the prediction of wide-sense stationary processes and linear dynamical 
systems respectively, see [23]. While such approaches avoid the pessimism of the 
worst-case viewpoint, they instead tend to be fragile to the underlying modeling 
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assumptions. In particular, an online algorithm designed to use a particular filter 
based on a particular stochastic model lacks the robustness to be used in settings 
where the underlying assumptions are not valid. 

3.1. A General Prediction Model. A key contribution of this paper is the devel¬ 
opment of a model for studying predictions that provides a middle ground between 
the worst-case and the stochastic viewpoints. The model we propose below seeks a 
middle ground by not making any assumption on the underlying stochastic process 
or the design of the predictor, but instead making assumptions only on the form of 
the error of the predictor. Thus, it is agnostic to the predictor and can be used in 
worst-case analysis with respect to the realization of the underlying cost functions. 

Further, the model captures three important features of real predictors: (i) it 
allows for correlations in prediction errors (both short range and long range); (ii) 
the quality of predictions decreases the further in the future we try to look ahead; 
and (iii) predictions about the future are refined as time passes. 

Concretely, throughout this paper we model prediction error via the following 
equation: 

t 

(7) yt = yt\T+ 

s=r+l 

Here, yt\T is the prediction of yt made at time t < t. Thus, yt — yt\r is the prediction 
error, and is specified by the summation in ([71). In particular, the prediction error 
is modeled as a weighted linear combination of per-step noise terms, e(s) with 
weights f{t — s) for some deterministic impulse function /. The key assumptions 
of the model are that e(s) are i.i.d. with mean zero and positive definite covariance 
Re] and that / satisfies /(O) = I and f{t) = 0 for t < 0. Note that, as the examples 
below illustrate, it is common for the impulse function to decay as /(s) ^ l/s“. 
As we will see later, this simple model is flexible enough to capture the prediction 
error that arise from classical filters on time series, and it can represent all forms 
of stationary prediction error by using appropriate forms of /. 

Some intuition for the form of the model can be obtained by expanding the 
summation in (l7|). In particular, note that for t = t — 1 we have 

(8) yt - yt\t-i = /(O)e(t) = e(t), 

which highlights why we refer to e(t) as the per-step noise. 

Further, expanding the summation further gives 

(9) yt - yt\T = /(O)e(t) -I- /(l)e(t - 1) + ... + f{t - t - l)e(r -I- 1). 

Note that the first term is the one-step prediction error yt — yt\t-i] the first two 
terms make up the two-step prediction error yt — yt\t- 2 ] and so on. This highlights 
that predictions in the model have increasing noise as one looks further ahead in 
time and that predictions are refined as time goes forward. 

Additionally, note that the form of (jH) highlights that the impulse function / 
captures the degree of short-term/long-term correlation in prediction errors. Specif¬ 
ically, the form of f(t) determines how important the error t steps in the past is 
for the prediction. Since we assume no structural form for /, complex correlation 
structures are possible. 

Naturally, the form of the correlation structure plays a crucial role in the per¬ 
formance results we prove. But, the detailed structure is not important, only its 
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effect on the aggregate variance. Specifically, the impact of the correlation struc¬ 
ture on performance is captured through the following two definitions, which play 
a prominent role in our analysis. First, for any w > 0, let ||/u,|P be the two norm 
of prediction error covariance over {w + 1) steps of prediction, i.e., 

W 

(10) WfwW^ =tr(E[(5yu,(5y((;]) = tr(i?e ^/(s)^/(s)), 

s^O 

where Sy'^ = yt+w - yt+w\t-i = f{t + w- s)e(s). The derivation of (HH]) is 

found in the proof of Theorem [5l 

Second, let F(w) be the two norm square of the projected cumulative prediction 
error covariance, i.e., 

w w 

(11) F(a;) = ^E||rf5y^||2 =tr(7?eX](w)-s-hl)/(s)^/('is:V(s))- 

t=0 s=0 

Note that KK^ is the orthogonal projector onto the range space of K. Hence it 
is natural that the definitions are over the induced norm of KK^ since any action 
chosen from the space F can only be mapped to the range space of K i.e. no 
algorithm, online or offline, can track the portion of y that falls in the null space of 
K. 

Finally, unraveling the summation all the way to time zero highlights that the 
process yt can be viewed as a random deviation around the predictions made at 
time zero, yj|o := yt, which are specified externally to the model: 

t 

(12) yt=yt + '^f{t- s)e{s). 

S = 1 

This highlights that an instance of the online convex optimization problem can 
be specified via either the process yt or via the initial predictions yt, and then 
the random noise from the model determines the other. We discuss this more when 
defining the notions of regret and competitive ratio we study in this paper in Section 


3.2. Examples. While the form of the prediction error in the model may seem 
mysterious, it is quite general, and includes many of the traditional models as 
special cases. For example, to recover the worst-case prediction model one can set, 
Vt,e(t) = 0 and yt' as unknown Vt' > t + w and then take the worst case over 


y. Similarly, a common approach in robust control is to set f(t) 


I, t = 0; 
0, t 7^ 0 ’ 


|e(s)| < D,\/s and then consider the worst case over e. 

Additionally, strong motivation for it can be obtained by studying the predictors 
for common stochastic processes. In particular, the form of 0 matches the predic¬ 
tion error of standard Weiner filters [44] and Kalman filters [24], etc. To highlight 
this, we include a few brief examples below. 


Example 1 (Wiener Filter). Let {yt}t=o be a wide-sense stationary stochastic pro¬ 
cess with E[yt] = yt, and covariance E[(?/i — yi){yj — yj)’^] = Ry{i — j), i-e., the 
covariance matrix Ry > 0 of y = [yi j /2 is a Toeplitz matrix. The corre¬ 

sponding e(s) in the Wiener filter for the process is called the “innovation process” 


MANGJUN CHEN, ANISH AGARWAL, ADAM WIERMAN, SIDDHARTH BARMAN, LACHLAN L. H. ANDREW 


and can be computed via the Wiener-Hopf technique [23] . Using the innovation 
process e(s), the optimal causal linear prediction is 

T 

yt\T = yt + ^(yt,e(s)}||e(s)|r^e(s), 

S=1 

and so the correlation function f{s) as defined in ([7]) is 

(13) f{s) = (y„e(0))||e(0)||-2 = Ry{s)Rz\ 
which yields 

^ w ^ w 

ll/™ll^ = -^^Ryisf andF{w) = — - s + 

s=0 s=0 

Example 2 (Kalman Filter). Consider a stationary dynamical system described 
by the hidden state space model 

x't+i = Axt +But, yt = Cx't + vt. 

where the {ut,Vt,xe)\ are m x l,p x 1, and n x 1-dimensional random variables 
such that 

'Q6tj S6tj 0 O' 

S*S,, RSij 0 0 . 

0 0 Ho 0 

The Kalman filter for this process yields the optimal causal linear estimator yt^^ = 
K[yf,.. such thatyt\r = argminE^||?/t-Kr'[yf,... Whent is large 

and the system reaches steady state, the optimal prediction is given in the following 
recursive form [23] ; 

Xt+i\t = Axt\t-i + Kpeit), j/oi-i = 0, e(0) = j/o, 
e{t) = yt- 

where Kp = (ARC* + BS)Rf^, is the Kalman gain, and R^ = R CPC* is the 
covariance of the innovation process et, and P solves 

P = APA* + CQC* - KpReK*p. 

This yields the predictions 

T 

yt\r = 

s^l 

r 

= CA^-^-^{APC* + BS)Rf^e{s). 

Thus, for stationary Kalman filter, the prediction error correlation function is 

( 14 ) f{s) = CA^-^APC* + BS)Rf^ = CA^-^Kp, 
which yields 

w 

||/„||2 = Y^^iReiCA^-^UpfUK^CA^-^Kp)) and 

s-0 

w 

R{w) = Y^'^ ~s + l)tT{Re{CA‘’-^KpfKK'^{CA*-^Kp)). 

s=0 
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3.3. Performance Metrics. A key feature of the prediction model described 
above is that it provides a general stochastic model for prediction errors while 
not imposing any particular underlying stochastic process or predictor. Thus, it 
generalizes a variety of stochastic models while allowing worst-case analysis. 

More specifically, when studying online algorithms using the prediction model 
above, one could either specify the instance via yt and then use the form of (l7|) to 
give random predictions about the instance to the algorithm or, one could specify 
the instance using y := ?/(|q and then let the yt be randomly revealed using the form 
of p2ll . Note that, of the two interpretations, the second is preferable for analysis, 
and thus we state our theorems using it. 

In particular, our setup can be interpreted as allowing an adversary to specify 
the instance via the initial (time 0) predictions y, and then using the prediction 
error model to determine the instance yt- We then take the worst-case over y. This 
corresponds to having an adversary with a “shaky hand” or, alternatively, letting 
the adversary specify the instance but forcing them to also provide unbiased initial 
predictions. 

In this context, we study the following notions of (expected) regret and (ex¬ 
pected) competitive ratio, where the expectation is over the realization of the pre¬ 
diction noise e and the measures consider the worst-case specihcation of the instance 

y- 


Definition 3. We say an online algorithm ALG, has (expected) regret at most 
p{T) if 

(15) supEe[cost(ALG) — cost(S'TA)] < p{T). 


Definition 4. We say an online algorithm ALG is p(T)- competitive (in expec¬ 
tation) if 

Ee [cost (ALG)] 


(16) 


Ee[cost(OPr)] 


< P(T). 


Our proofs bound the competitive ratio through an analysis of the competitive 
difference, which is defined as follows. 


Definition 5. ff^e say an online algorithm ALG has (expected) competitive dif¬ 
ference at most p{T) if 

(17) sup Eg [cost(ALG) — cost(OPr)] < p{T). 

V 

Note that these expectations are with respect to the prediction noise, {e{t))Jfi, 
and so cost(OPT) is also random. Note also that when cost(OPL) G fl(p(r)) and 
ALG has competitive difference at most p{T), then the algorithm has a constant 
(bounded) competitive ratio. 


4. Averaging Fixed Horizon Control 


A wide variety of algorithms have been proposed for online convex optimization 
problems. Given the focus of this paper on predictions, the most natural choice of an 
algorithm to consider is Receding Horizon Control (RHC), a.k.a.. Model Predictive 
Control (MPC). 

There is a large literature in control theory that studies RHC/MPC algorithms, 
e.g., [121133] and the references therein; and thus RHC is a popular choice for 
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online optimization problems when predictions are available, e.g., [g SSI Eg [n]. 
However, recent results have highlighted that while RHC can perform well for one¬ 
dimensional smoothed online optimization problems, it does not perform well (in 
the worst case) outside of the one-dimension case. Specifically, the competitive ratio 
of RHC with perfect lookahead w is l-\-0{l/w) in the one-dimensional setting, but 
is 1 -I- H(l) outside of this setting, i.e., the competitive ratio does not decrease to 1 
as the prediction window w increases |28j . 

In contrast, a promising new algorithm, Averaging Fixed Horizon Control (AFHC) 
proposed by EH] in the context of geographical load balancing maintains good 
performance in high-dimensional settings, i.e., maintains a competitive ratio of 
\+0{l/w% Thus, in this paper, we focus on AFHC. Our results highlight that 
AFHC extracts the asymptotically optimal value from predictions, and so validates 
this choice. 

As the name implies, AFHC averages the choices made by Fixed Horizon Con¬ 
trol (FHC) algorithms. In particular, AFHC with prediction window size {w -I- I) 
averages the actions of [w -I- I) FHC algorithms. 

Algorithm 1 (Fixed Horizon Control). Let ilk = {i '■ i = k mod {w + l)}n[—r(;,r] 
for k = 0,... ,w. Then FHC^^\w + 1), the kth FHC algorithm is defined in the 
following manner. At timeslot t G ilk (i.e., before Cr is revealed), choose actions 
^FHC t t = T,.. .T + w as follows: 

Ift^O) ^fhc t ~ 0. Otherwise, let Xr-i = x^fhct- 1 ’ (^fhc t)t=T ^6 

the vector that solves 

T-\-W 

min ct(a;t)-I-/?||(xt - xt_i)|| 

t = T 

where q(-) is the prediction of the future cost ct(-) for t = t, ... ,t + w. 

Note that in the classical OCO with {w -I- l)-lookahead setting, ct{-) is exactly 
equal to the true cost c(-). Each FHC(^^(ri; -I- 1) can be seen as a length {w + I) 
fixed horizon control starting at position k. Given {w + 1) versions of FHC, AFHC 
is defined as the following: 

Algorithm 2 (Averaging Fixed Horizon Control). At timeslot t G 1,..., T, AFHC(w+ 
1) sets 

1 

(18) XAFHC.t = ~~T^ ^FHC.f 

^ k=0 

5. Average-case Analysis 

We first consider the average-case performance of AFHC (in this section), and 
then consider distributional analysis (in Section [6]). We focus on the tracking prob¬ 
lem in (EJ for concreteness and conciseness, though our proof techniques generalize. 
Note that, unless otherwise specified, we use || • || = || • II 2 . 

Our main result shows that AFHC can simultaneously achieve sublinear regret 
and a constant competitive ratio using only a constant-sized prediction window in 

^Note that this result assumes that the action set is bounded, i.e., for all feasible action x,y, there 
exists D > 0, such that ||a: — y\\ < D, and that there exists eo > 0, s.t. ct(0) > cq, Vt. The results we 
prove in this paper make neither of these assumptions. 
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nearly all cases that it is feasible for an online algorithm to do so. This is in stark 
contrast with Theorem [T] for the worst-case prediction model. 

Theorem 2. Let w be a constant. AFHC('w + 1) is constant-competitive whenever 
inff.EplOPTl = n(r) and has sublinear reqret whenever inf,-, EplSTAl > a^T — 
o{T), for ai=W + 8B^, where 




(19) 

( 20 ) 


w -I- 1 




and ||M||i denotes the induced 1-norm of a matrix M 

Theorem [2] imposes bounds on the expected costs of the dynamic and static 
optimal in order to guarantee a constant competitive ratio and sublinear regret. 
These bounds come about as a result of the noise in predictions. In particular, 
prediction noise makes it impossible for an online algorithm to achieve sublinear 
expected cost, and thus makes it infeasible for an online algorithm to compete with 
dynamic and static optimal solutions that perform too well. This is made formal in 
Theorems [3] and m which are proven in Appendix|Bl Recall that Re is the covariance 
of an estimation error vector, e(t). 

Theorem 3. Any online algorithm ALG that chooses Xt using only (i) internal 
randomness independent of e{-) and (ii) predictions made up until time t, has ex¬ 
pected cost Ee[cost(ALG)] > a 2 T -\-o{T), where 02 = 

Theorem 4. Consider an online algorithm ALG such that Ee[cost(ALG)] € o(T). 
The actions, xt, of ALG can be used to produce one-step predictions y[\^_i, such 
that mean square of the one-step prediction error is smaller than that for yt\t-i, 
i.e., Ee||?/t - < Eellyt - for all but sublinearly many t. 

Theorem [ 3 ] implies that it is impossible for any online algorithm that uses extra 
information (e.g., randomness) independent of the prediction noise to be constant 
competitive if Ee[cost(OPr)] = o(T) or to have sublinear regret if Ee[cost(S'rA)] < 
(02 ~ £)T o{T), for e > 0. 

Further, Theorem [3] states that if an online algorithm does somehow obtain 
asymptotically smaller cost than possible using only randomness independent of 
the prediction error, then it must be using more information about future yt than 
is available from the predictions. This means that the algorithm can be used to 
build a better predictor. 

Thus, the consequence of Theorems [3] and 0] is the observation that the condition 
in Theorem [2] for the competitive ratio is tight and the condition in Theorem [2] for 
regret is tight up to a constant factor, i.e., ai versus 02 - (Attempting to prove 
matching bounds here is an interesting, but very challenging, open question.) 

In the remainder of the section, we outline the analysis needed to obtain Theorem 
01 which is proven by combining Theorem [S] bounding the competitive difference 
of AFHC and Theorem 0] bounding the regret of AFHC. The analysis exposes the 
importance of the correlation in prediction errors for tasks such as determining 
the optimal prediction window size for AFHC. Specifically, the window size that 
minimizes the performance bounds we derive is determined not by the quality of 
predictions, but rather by how quickly error correlates, i.e., by ||/cj|p. 
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Proof of Theorem [2l The first step in our proof of Theorem [2] is to bound the 
competitive difference of AFHC. This immediately yields a bound on the compet¬ 
itive ratio and, since it is additive, it can easily be adapted to bound regret as 
well. 

The main result in our analysis of competitive difference is the following. This 
is the key both to bounding the competitive ratio and regret. 

Theorem 5. The competitive difference of AFHC{w + l) is 0(T) and bounded by: 


( 21 ) 


sup Ee [cost(AFiJC) — cost(OPT)] < VT 


V 


where V is given by (HU) 

Theorem[5]implies that the competitive ratio of AFHC is bounded by a constant 
when cost(OPr) S f2(T). 

The following corollary of Theorem [5] is obtained by minimizing V with respect 
to w. 

Corollary 6. For AFHC, the prediction window size that minimizes the bound 
in Theorem [3| on competitive difference is a finite constant (independent of T) if 
F{T) G u}(T) and is T if there is i.i.d noisj^. 

The intuition behind this result is that if the prediction model causes noise to 
correlate rapidly, then a prediction for a time step too far into the future will be 
so noisy that it would be best to ignore it when choosing an action under AFHC. 
However, if the prediction model is nearly independent, then it is optimal for AFHC 
to look over the entire time horizon, T, since there is little risk from aggregating 
predictions. Importantly, notice that the quality (variance) of the predictions is 
not determinant, only the correlation. 

Theorem [5] is proven using the following lemma (proven in the appendix) by 
taking expectation over noise. 

Lemma 7. The cost of AFHC{w -I- 1) for any realization of yt satisfies 


cost{AFHC) — cost(OPT) < 



Next, we use the analysis of the competitive difference in order to characterize 
the regret of AFHC. In particular, to bound the regret we simply need a bound on 
the gap between the dynamic and static optimal solutions. 

Lemma 8. The suboptimality of the offline static optimal solution ST A can be 
bounded below on each sample path by 


cost(S'TA) — cost(OPT) 



where \ 



^Specifically /(O) = I, f(t) = OVf > 0 
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Note that the bound above is in terms of ||(2/t —which can be interpreted 
as a measure of the variability yt- Specifically, it is the projection of the variation 
onto the range space of K. 

Combining Theorem [5] with Lemma [8] gives a bound on the regret of AFHC, 
proven in Appendix [BJ 

Theorem 9. AHFC has sublinear expected regret if 

T 

iniEeY^WKK^yt-yW > {8V + 16B^)T, 

where V and B are defined in (Hi) and dSi. 

Finally, we make the observation that, for all instances of y: 

1 ^ 

cost (5rA) = - ^ I |yt - AT 11' +/3| 1x11 1 
1 

> 

^ t=i 
T 

^ t=i 

Hence by Theorem |i we have the condition of the Theorem. 

6. Concentration Bounds 

The previous section shows that AFHC performs well in expectation, but it is 
also important to understand the distribution of the cost under AFHC. In this 
section, we show that, with a mild additional assumption on the prediction error 
eft), the event when there is a large deviation from the expected performance bound 
proven in Theorem [5] decays exponentially fast. 

The intuitive idea behind the result is the observation that the competitive 
difference of AFHC is a function of the uncorrelated prediction error e(l),..., e(T) 
that does not put too much emphasis on any one of the random variables eft). 
This type of function normally has sharp concentration around its mean because 
the effect of each e{t) tends to cancel out. 

For simplicity of presentation, we state and prove the concentration result for 
AFHC for the one dimensional tracking cost function 

1 ^ 

2 - xt)"^ + I3\xt - xt-i\. 

^ t=i 

In this case. Re = cr^, and the correlation function / : N —>■ i? is a scalar valued 
function. The results can all be generalized to the multidimensional setting. 

Additionally, for simplicity of presentation, we assume (for this section only) 
that {e{t)}'ff.^ are uniformly bounded, i.e., 3e > 0, s.t. Vt, |e(f)| < e. Note that, 
with additional effort, the boundedness assumption can be relaxed to the case of 
eft) being subgaussian, i.e., E[exp(e(t)^/e^)] < 2, for some e > 00 

^This involves more computation and worse constants in the concentration bounds. Interested 
readers are referred to Theorem 12 and the following remark of for a way to generalize the 
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To state the theorem formally, let VT be the upper bound of the expected 
competitive difference of AFHC in ((2T]l . Given {yt}t=i, the competitive difference 
of AFHC is a random variable that is a function of the prediction error e{t). The 
following theorem shows that the probability that the cost of AFHC exceeds that 
of OPT by much more than the expected value VT decays rapidly. 


Theorem 10. The prohahility that the competitive difference of AFHC exceeds 
VT is exponentially small, i.e., for any u > 0; 


{cost{AFHC) - cost (OPT) > VT + u) 
( ..2 \ 


< exp 


<2 exp 


0,2 Ilf ||2 

(™ + l)a2 N/“N 

„,2 


+ exp 


16e2A(2^P(u;) + u) 


+ buJ ’ 

where H/tulP = parameter X of concentration 


w 

A < '^{w - f)f{tf = 




and a = 8e^[T/(tc + 1)] max(^| |/u,|p, 4:XF{w)), b = 16e^A. 


The theorem implies that the tail of the competitive difference of AFHC has a 
Bernstein type bound. The bound decays much faster than the normal large devia¬ 
tion bounds obtained by bounding moments, i.e., Markov Inequality or Chebyshev 
Inequality. This is done by more detailed analysis of the structure of the competitive 
difference of AFHC as a function of e = (e(l),..., e{T)Y'. 

Note that smaller values of a and b in Theorem ITOl imply a sharper tail bound. 
We can see that smaller | |/u, 11 and smaller F{w) implies the tail bound decays faster. 
Since higher prediction error correlation implies higher ||/u,|| and F(w), Theorem 
[TU] quantifies the intuitive idea that, the performance of AFHC concentrates more 
tightly around its mean when the prediction error is less correlated. 


Proof of Theorem llOl To prove Theorem llOl we start by decomposing the bound 
in Lemma [3 In particular. Lemma [3 gives 

(22) cost{AFHC) - cost(OPT) < gi + g 2 


where 


1 


91 = 


w 


/3i4-i -4 -iL 


k—Q 

represents loss due to the switching cost, and 

r-\-w 


I U2 / UJ 

52 = E E E 2 ^5* - yt\r-l)\ 

k=0 rGOfc t=T 

represents the loss due to the prediction error. 


concentration bound for the switching cost (Lemma Hill , and Theorem 1.1 of [37] for a way to 
generalize the concentration bound for prediction error ("Lemma 1 151) . 
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Let Fi = gL + ^ll/^lb, and 1/2 = Note that VT = + ^ 2 - 

Then, by (l2^ . 

F{cost{AFHC) - cost(OPT) > u + VT) 

< F{gi > u/2 + Vi or 52 > u/2 + V 2 ) 

(23) < IP(5i ^ u/2 + Vi) + P(52 ^ u/2 + V^). 

Thus, it suffices to prove concentration bounds for the loss due to switching cost, 
gi, and the loss due to prediction error, 32 , deviating from Vi and V 2 respectively. 
This is done in the following. The idea is to first prove that gi and g 2 are functions 
of e = (e(l),..., e(T))’^ that are not “too sensitive” to any of the elements of e, and 
then apply the method of bounded difference [34] and Log-Sobolev inequality [27] . 
Combining with Lemmas [TT] and ITS] below will complete the proof of Theorem 

[TO] 

Bounding the loss due to switching cost. This section establishes the following 
bound on the loss due to switching: 

Lemma 11. The loss due to switching cost has a sub-Gaussian tail: for any u > 0, 

(24) 

To prove Lemma 1111 we introduce two lemmas. Firstly, we use the first order 
optimality condition to bound gi above by a linear function of e = (e(l),..., e(T))^ 
using the following lemma proved in the Appendix. 


Lemma 12. The loss due to switching cost can be bounded above by 


(25) 


9i < 


3/32T /? 


w + 1 w + 1 


EE 

k—O 


r —1 


E /(^ - 1 - s)e(s) 

3=1v(t—I d—2) 


Let g'i{e) be the second term of gi. Note that the only randomness in the upper 
bound (ITOl) comes from g[. 


Lemma 13. The expectation of g[{e) is bounded above by 

Eeg[{e) < ^\\U\\2. 
w + I 

With Lemma [TO] we can reduce (1241) to proving a concentration bound on g[{e), 
since 


(26) F{gi >u + Vi) < F{g[ - Eg[{e) < u). 

To prove concentration of g'lie), which is a function of a collection of independent 
random variables, we use the method of bounded difference, i.e., we bound the 
difference of g[ (e) where one component of e is replaced by an identically-distributed 
copy. Specifically, we use the following lemma, the one-sided version of one due to 
McDiarmid: 


Lemma 14 ( |34] . Lemma 1.2). Let X = (Ai,...,X„) be independent random 
variables and Y be the random variable f{Xi,... ,Xn), where function f satisfies 


\f{x) - /(4)| < Cfc 
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whenever x and x'f. differ in the kth coordinate. Then for any t > 0, 



formed by replacing e{k) with an independent and identically distributed copy e'{k). 


Then 



m—Q 



Hence 



By Lemma HU 



Substituting this into (1261) and finishes the proof. 


□ 


Bounding the loss due to prediction error. In this section we prove the following 
concentration result for the loss due to correlated prediction error. 

Lemma 15. The loss due to prediction error has Berstein type tail: for any u > 0, 



(27) 


To prove Lemma fTSl we characterize g 2 as a convex function of e in Lemma [T6l 
We then show that this is a self-bounding function. Combining convexity and self- 
bounding property of 52 , Lemma [TTl makes use of the convex Log-Sobolev inequality 
to prove concentration of g 2 . 

Lemma 16. The expectation of g 2 is E 52 = ^ 2 , o^nd g 2 is a convex quadratic 
form of e. Specifically, there exists a matrix A £ such that 52 = 

Furthermore, the spectral radius of X of AA^ satisfies A < F{w). 

Hence, (|27ll is equivalent to a concentration result of 32 : 

1P(52 > V2-\-u) = F{g2 - Eg2 > u). 

The method of bounded difference used in the previous section is not good for a 
quadratic function of e because the uniform bound of \g 2 ie) — 52 (e^)| is too large 
since 


152 (e) - 52 (efc)| = ^l(e - e'{k)fA'^A{e + e'{k))\, 
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where the (e + e'(fc)) term has T non-zero entries and a uniform upper bound of this 
will be in 0,{T). Instead, we use the fact that the quadratic form is self-bounding. 
Let h{e) = ( 72 (e) — E( 72 (e). Then 

||Vh(e)||^ = = {Aef{AA^){Ac) 

< X{Aef{Ae) = 2X[h{e)+EV 2 ]. 

We now introduce the concentration bound for a self-bounding function of a col¬ 
lection of random variables. The proof uses the convex Log-Sobolev inequality [27] . 


Lemma 17. Let f : R" —^ K 6 e convex and random variable X be supported on 
[—d/2, d/2]”. IfE[f{X)] = 0 and f satisfies the self-bounding property 

(28) ||V/|p<a/ + &, 

for a,b > 0, then the tail of f{X) can be bounded as 

i 


(29) 


V{f{X) >t}< exp I 


^d?{2b at) J 

Now to complete the proof of Lemma llSl apply Lemma flTl to the random variable 
Z = h{e) to obtain 

/ „2 

P {52 - E 52 > u} < exp -- 


8X^^^e^i2V2 + u) 


for t > 0, i.e.. 


P{g 2 > M + V 2 } < exp ( — 


8A„,axe2(2F2+t) 


= exp 


+u) ^ 


7. Concluding Remarks 

Making use of predictions about the future is a crucial, but under-explored, 
area of online algorithms. In this paper, we have introduced a general colored 
noise model for studying predictions. This model captures a range of important 
phenomena for prediction errors including, general correlation structures, prediction 
noise that increases with the prediction horizon, and rehnement of predictions as 
time passes. Further it allows for worst-case analysis of online algorithms in the 
context of stochastic prediction errors. 

To illustrate the insights that can be gained from incorporating a general model 
of prediction noise into online algorithms, we have focused on online optimization 
problems with switching costs, specifically, an online LASSO formulation. Our 
results highlight that a simple online algorithm, AFHC, can simultaneously achieve 
a constant competitive ratio and a sublinear regret in expectation in nearly any 
situation where it is feasible for an online algorithm to do so. Further, we show 
that the cost of AFHC is tightly concentrated around its mean. 

We view this paper as a first step toward understanding the role of predictions 
in the design of online optimization algorithms and, more generally, the design of 
online algorithms. In particular, while we have focused on a particular, promising 
algorithm, AFHC, it is quite interesting to ask if it is possible to design online 
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algorithms that outperform AFHC. We have proven that AFHC uses the asymp¬ 
totically minimal amount of predictions to achieve constant competitive ratio and 
sublinear regret; however, the cost of other algorithms may be lower if they can use 
the predictions more efficiently. 

In addition to studying the performance of algorithms other than AFHC, it would 
also be interesting to generalize the prediction model further, e.g., by considering 
non-stationary processes or heterogeneous e(t). 
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Appendix A. Proofs for Section O 

A.l. Proof of Theorem [TJ For a contradiction, assume that there exists an al¬ 
gorithm A' that achieves constant competitive ratio and sublinear regret with con¬ 
stant lookahead. We can use algorithm A' to obtain another online algorithm A 
that achieves constant competitive ratio and sublinear regret without lookahead. 
This contradicts Theorem 4 of [2] , and we get the claim. 

Consider an instance {ci, C 2 , ..., ct} without lookahead. We simply “pad” the 
input with i copies of the zero function 0 if A! has a lookahead of 1. That is the 
input to Al is: ci, 0,..., 0, C 2 , 0,..., 0, cs, 0,... 

We simulate A and set the tth action of A equal to the ((t — l)(f -|- 1) -|- l)th 
action of A'. Note that the optimal values of the padded instance are equal to the 
optimal values of the given instance. Also, by construction, cost(A) < cost(A'). 
Therefore, if A achieves constant competitive ratio and sublinear regret then so 
does A, and the claim follows. 


Appendix B. Proofs for Section O 

B.l. Proof of Theorem [3j 

Proof. Let {xALG,t)f=i be the solution produced by online algorithm ALG. Then 


cost(ALG) >-'^\\yt- KxalgA^ 

^ t=i 

I ^ 

=n E + WKK^yt - kxalgA 


by the identity (/ - KK'>)K = 0. Let e* = XALG.t - K'^yt\t-i, be., e* = 
XALG,t — K'^iutlo ~ Es=i /(^ ~ 'S)e(s)). Since all predictions made up until t can be 
expressed in terms of y.|o and e(r) for t < t, which are independent of e(t), and all 
other information (internal randomness) available to ALG is independent of e(t) 
by assumption, e* is independent of eft). It follows that 


(30) 


Ee[cost(ALG)l > E4\\KK^yt - KiK^ytu-i + et)ll"l 

1 ^ 


t=l 

T 


= xEEe\eW(l|Ay"llKXt +ll(Ee(t)CtG)'/"llGx) 


>^|| r 1 / 2||2 

WkkU 


where the first equality uses the identity (/ — KK^)K = 0, and the second uses 
the independence of Ct and e(t). □ 
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B.2. Proof of Theorem |4l By Theorem [3l if E[cost(ALG)] € o(T), there must 
be some t such that et is not independent of e{t). By expanding the square term in 
o and noting it is nonnegative: 

neitfKet] < 

Each nonzero ¥,[e(t)’^Ket] can at most make one term in (l30l) zero, since there 
are T terms in (l30l) that are each lower bounded by and by assumption 

E[cost(ALG)] is sublinear. There can be at most a sublinear number of t such that 
E[e{YKet] = 0. 

For every other t, we must have E[e(t)^itret] > 0. Let k = Ket] > 0, and 

at = WiEetefY^W^j^j^^ p > 0 . 

Hence, at time t, the algorithm can produce prediction y't^t-i ~ 
where the coefficient Wt is chosen later. Then the one step prediction error variance: 

Wt 

||7-)1/2||2 ^7 I 1 

— F-H-oO-t* 

Wt wf 

Pick any wt > at/2lt. Then E|| 2 /t - = ^Wvt - 2/t|t-ilP- Hence 

ALG can produce better one-step prediction for all but sublinearly many t. 

B.3. Proof of Lemma [3 To prove Lemma [71 we use the following Lemma. 

Lemma 18. The competitive difference of FHC with fixed (w + 1)-lookahead for 
any realization is given by 

cost(EHG^*^^) <cost(OPr) + 'Y, ~ *r-l)l|l 

reOi, 

T+W 

+ 2^ 

t = T 

where xl is the action chosen by the dynamic offline optimal. 

of lemma 0 Note that cosi{FHC^^^) is convex. The result then follows with a 
straightforward application of Jensen’s inequality to Lemma |18I By the definition 
of AFFfC, we have the following inequality: 

1 ^ 

cost (AFHC) < -^^cost(PHG('')) 

By substituting the expression for cost{FFfC^'^^) into the equation above and 
simplifying, we get the desired result. □ 

Before we prove Lemma [T51 we first introduce a new algorithm we term OPEN. 
This algorithm runs an open loop control over the entire time horizon, T. Specif¬ 
ically, it chooses actions Xt, for t G 1,...,T, that solves the following optimization 
problem: 

1 ^ 

min- + PWY “ a:t-i)l|i 

^ t = l 

FHC^^I can be seen as starting at using prediction y.|T-i, and run¬ 
ning OPEN from r to r -|- w. Then repeating with updated prediction We 

first prove the following Lemma characterizing the performance of OPEN. 
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Lemma 19. Competitive difference of OPEN over a time horizon, T, is given hy 
cost(0 PEN) - cost {OPT) <'^-\\yt- yt\\%K^ 

t=i ^ 

Proof. Recall that the specific OCO we are studying is 

T 1 

(31) ““X] 2^'^* “ Kxt\f +ff\{xt - 2 ;t-i)||i 

where Xt G K”, yt G R™, K G and the switching cost, (3 G M+. 

We first derive the dual of (1511) by linearizing the li norm which leads to the 
following equivalent expression of the objective above: 

1 

min-^||yt -Kxt\\^ + Zt 

x,z A 

S.t. Zt>Xt—Xt-l,Zt>Xt-l—Xt, Vt. 

Hence the Lagrangian is 

L{x,z\\,X) = 2 111/* “ Kxt\\^ + (At -Xt,xt - xt-i) 

^ t=l 

+ {Pl — (At + Aj), zt). 
where we take At+i = 0 and xq = 0. 

Let \t = \t — A( and wt = At + Af Dual feasibility requires Wt < /31,Vt, which 
implies — /31 < At < /31,Vt. Dual feasibility also requires (/31 — Wt,Zt) = 0,V<. 

Now by defining st = Xt — At+i and equating the derivative with respect to Xt 
to zero, the primal and dual optimal Xt,sl must satisfy K^Kx^ = K"^yt — Sj. 

Note by premultiplying the equation above by a;*^, we have {xt,s*) = {Kxt,yt) — 
||Rra;J'|p. If instead we premultiply the same equation by {K'^ff, we have after some 
simplification that Kxt = {KK^)yt — {K'^ff s*. We can now simplify the expression 
for the optimal value of the objective by using the above two equations: 

cost(OPr) = ^ -\\yt - KxtW^ + {xl,s:) 

(32) =E ^Il2/*ll' - ^ll^^^y* - 

t=l ^ 

Observe that (15^ implies minimizes the following expression \ \KK"^yt — 

(iG^)'l'S( IP over the constraint set S = {st|st = At — At+i, —/31 < At < /31 for 1 < 
t <T, At+i = 0}. 

cost{OPEN) — cost(OPr) = p{x-, y) — p{x; y) 

=p{x; y) - p{x; y) + p{x; y) - p(+; yt) 
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Expanding the quadratic terms, using the property of the pseudo-inverse that 
KK^ = K\ and using the fact that Kx^ = KK^yt — we have 

cost (OPEN) — cost (OPT) 

= E 5 - WiKK^yt - {K^yst)\f) 

t = l ^ 

+ i {\\yt - yt\\^ - ll(/ - KK^){yt - yt)\\^) 

T 

< E \\\y^ - - ^ii(^ - - y^)\\^ 

t=i ^ 

= j^\\\KK\yt-yt)\\\ 

t = l 

where the first inequality is because of the characterization of following (15^ . 

□ 

of Lemma I j<$l The proof is a straightforward application of Lemma 1191 Summing 
the cost of OPEN for all t G Lli- and noting that the switching cost term satisfying 
the triangle inequality gives us the desired result. □ 

B.4. Proof of Theorem [5l We first define the sub-optimality of the open loop 
algorithm over expectation of the noise. ]E[||(yt — J/OllRR-t] is the expectation of the 
projection of the prediction error t -|- 1 time steps away onto the range space of itT, 
given by: 

t 

E[||(2/t - yOllEt] = n\J2KKHf{t - s)e(s))||^ 

S = 1 

£ t 

=E[E E e{siff{t-sif{KK^f{KK^)f{t-S 2 )e{s 2 )] 

si=l S2 = l 
t t 

=t^(E E f{t-sif{KK^)iKK^)f{t-S2me{s2)eisrf]) 

si = l S2 = l 
£-1 

s-0 

where the last line is because E[e(si)e(s 2 )^] = 0 for all si ^ S 2 , and KK^K = 
K. Note that this implies ||/t_i|p = ti'(/(s)^/(s).Re)- We now write the 

expected suboptimality of the open loop algorithm as 

T 

E[cost(OPPiV) - cost(OPr)] < ^ -E[\\yt - yt\\lj,,] 

t = l ^ 

T- 1 T -1 

= 2 E E 

s=0 t=s 

= 9 E(2^ - sMf{sfKK^fis)R,) = F{T - 1) 
where the first equality is by rearranging the summation. 
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Now we take expectation of the expression we have in Lemma [3 Taking expec¬ 
tation of the second penalty term (prediction error term), we have: 

^ w r-\-w . 

k—O T^Clk 't—T 

1 _ 7^ 

We now need to bound the first penalty term (switching cost term). By taking 
the subgradient with respect to Xt and by optimality we have Vt = 1,..., T 

0 € K^{Kxi - yt) + mM - 4-i)iii+ mM+i - a^niii 

^x* e [{K^K)-\K^yt - 2/31), [K^K)-\K^yt + 2/31)] 

where the implication is because the sub-gradient of a 1-norm function jj • jji is 
between —1 to 1. 

(k) 

Similarly, since x)._i is the last action taken over a FHC horizon, we have that 
for all T G flk, 

xi% G [(7^^7f)-'(7^^i/,_i|,_,,_2 -/31), 

+ /31)] 


Taking expectation of one of the switching cost term and upper bounding with 
triangle inequality: 

(33) <||At||i||/„||+3/3||(A^A)-il||i 

where the first inequality is by the definition of induced norm, the second inequality 
is due to concavity of the square-root function and Jensen’s inequality. Summing 
(1331) over k and r, we have the expectation of the switching cost term. Adding 
the expectation of both penalty terms (loss due to prediction error and loss due to 
switching cost) together, we get the desired result. 


B.5. Proof of Lemma [8l We first characterize cost{STA): 

1 ^ 

cost(S'rA) = min - \\yt — Aa;||2 -|- Pl'^x 
T. 9 f ^ 


By first order conditions, we have the optimal static solution a: = K^y—^{K'^K) ^1. 
Substituting this to cost(S'TA) and simplifying, we have: 

costiSTA) =i f] (||(7 - KK^)yt\\l + llKK^Vt - ml) 

^ t=l 


2T 


+ 131'^ my 
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Let C = ^ ^ Subtracting cost(OPT) in (I32|) from the above, we have 

cost{STA) — cost(OPT) equals: 

i j2{\\KK\yt - ml - WKK^ytWl + WKK^yt - {K^)Ut\\l) 

+ pi^my-c 

=5 E (i 1^"^^ -y)- \\t) + {my,pi-\i)-c 

^ t=i ^ 

^ t=i ^ 


The first equality is by expanding the square terms and noting st = Xt — At+i. 
The last inequality is because —(31 < At < /31 and (31^K^y being positive by 
assumption that the optimal static solution is positive. Now we bound the first 
term of the inequality above: 

^J2{\\KK\yt-y)-{K'^)U:\\l) 

^ t=l ^ 

E - y)ll') - - y). (K^Vs;) 

^ t=i ^ t=i 

>5 E - y)\\") - 2(3f2\\KKHyt - y)\\ ||(i^^)tl|| 


>5 E {W^^Hyt - y)f) - 2B pf^ ||(PPt)(y, - y )\\2 

^ t=l ^ \ t=l 

where B = /3||(iL^)ll||2. 

By subtracting C from the expression above and completing the sqaure, we have 
the desired result. 


B.6. Proof of Theorem [9j Using the results of LemmalU taking expectation and 
applying Jensen’s inequality, we have: 

Ee [cost(5TT) — cost (OPT)] 


>Ee[-Ell^^^(l/*-y)ll'- 25. 


\ t = l 


>i 

-2 


^^J2{\\KK^yt-yW)-2BVT^ -2B^T-C. 


c] 


Hence by Theorem [SJ the regret of AFHC is 

sup (Ee [cost (APPC)—cost (OPT) + cost(OPr)—cost(5TA)]) 
y 


< VT+ 2P^r+ C- - inf( 

2 y \ 


EeEll(2/‘-y)IIEt -25Vr 


Let s{T) = Eej:Li\\yt-y\\lKf By the above, to prove AFHC has sublinear 
regret, it is sufficient that 
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(34) VT + 2B^T - i inf(^5(7^ - 2BVff < g{T) 

2 y 

for some sublinear g{T). By the hypothesis of Theorem [9l we have infy 5'(r) > 
(8A+ 16B2)T. 

Then, S{T) > {y/2VT + AB'^T + 2By/Tf, and ([Ml) holds since VT + 2B^T - 
i infg(- 2B^/Tf <VT + 2B^T - i {^J2VT + AB^T)^ = 0. 


Appendix C. Proofs for Section [HI 
C.l. Proof of Lemma 1121 By the triangle inequality, we have 


51 


1 


w + 1 


/ 3 |a;*_i - 4-1 


A:=0 


1 ^ 

Y - 5t-i| + |5t-i - y^_ip-«,-2| 

fc —0 T^Clf^. 

+ - a:4il)- 


By first order optimality condition, we have x*_i € {yr-i — 2/3,5 t--i + 2/3}, and 
S {yT-_i|^_u ,-2 - /3, yr-ilT-w -2 + /3}- Hence, by the prediction model. 


^ 3/3^T 
51 < ——r + 




ui + 1 in + 1 


EE 


T — l 

Y /("T - 1 - s)e(s) 

5=lV (t — w — 2) 


C.2. Proof of Lemma 1131 Note that by Lemma fT^ we have 

T — l 


E5;(e)<-^EE® 


w + 1 


k = 0 


Y f(T-l-s)e(s 

s=lV (r —ID —2) 


< 


U) + 1 


EE 

fc=o 




’Yf^(^) = 


I3T 
ui + 1 


II/™ 


where the second inequality is by Jensen’s inequality and taking expectation. 

C.3. Proof of Lemma 1161 By definition of g 2 and unraveling the prediction 
model, we have 

.. W T + tD .. 

52 = E E E 2 - 5th-i)' 

k=0 t=T 

^ W T-\-W ^ t 

k = QTGflk * = 3 = T 


Writing it in matrix form, it is not hard to see that 
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where Ak has the block diagonal structure given by 

(35) Ak=dmg{Al,Al,...,Al,Al)E 

and there are the types of submatrices in Ak given by, for i = 1, 2, 3: 


//(O) 

/(I) 

0 

/(O) 

0 V 
0 

V/(«i) 

/K-l) ■ 

■ /(O)/ 


where = — 2iffc>2 and vi = k -\- w — 1 otherwise. U 2 = tc, Va = (T — fc + 1) 

mod (i(; +1). Note that in fact, the matrix A\ is the same for all k. Hence, we have 



1 

w + 1 


Y.AlAk)e = 

fe =0 



where we define A to be such that A^ A = done 

because the right-hand side is positive semidefinite, since Ak is lower triangular. 
The last equality is because all A^ has the same structure. Let A be the maximum 
eigenvalue of AA^ , which can be expressed by 

A = max x^AA^x 
lbll=i 


w -I- 1 


ax 

1=1^ 


x^^AkAkX < 


w + 


1 


where Afc is the maximum eigenvalue of AkA"^. Note that Ak has a block diagonal 
structure, hence AkA"[ also has block diagonal structure, and if we divide the vector 
X = (xi,X 2 , ■ ■ ■ ,Xm) into sub-vectors of appropriate dimension, then by the block 
diagonal nature of AkAj, we have 

T A aT T a1 aIT t a2 a2T 

X AkAkX=XiAkAk xi+X2AkAk X2 + ■ ■ ■ 

I T a2 a2T t Ai AiT 

“f Xm—lAkAk Xjri—1 X^AkAk Xm- 


Hence, if we denote the maximum eigenvalues of A^ as the maximum eigenvalue 

T 

of the matrix A].A\. , then we have 


Afc 


x'^x 


= max 

< max 


< 


AiA\A\ xi + X2 Aj.Aj. X2 + ■ ■ ■ + x'^A\A 


max 


X'^XI -I- ... -I- x'knX„ 

(AL Afc, Afc) • (x'lXl + . . . -|- x'^Xm) 

'T' . . T’ 


Lfc Xm 


iiia,x(Afc, Afc, Afc), 


xjx\ + . . . + x'^Xm 


. 'J' 

where X]. is the maximum eigenvalue of H). for i £ {1, 2,3}. As A\A\ are all 
positive semidefinite, we can bound the maximum eigenvalue by trace, and note 
that Afc and A\ are submatrix of A|, we have 

Afc < max(Afc, Afc, Afc) < tr(AfcAfc ) = —F{w). 


Ifc is repeated 


T-k+l 

CJ + 1 


times in At for k > 2, and 


I T-k-iA} I 
L “+1 J 


times 


for otherwise. 
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C.4. Proof of Lemma I17L To prove the lemma, we use the following variant of 
Log-Sobolev inequality 


Lemma 20 (Theorem 3.2, [27]). Let f : K" —>■ R 6e convex, and random variable 
X he supported on [—d/2, d/2]"', then 

E[exp{f{X))f{X)] - E[exp(/(X))] logE[exp(/(X))] 
<yE[exp(/(X))||V/(X)|n. 


We will use Lemma HOI to prove Lemma [T7| Denote the moment generating 
function of f{X) by 

m(6») := q^q 

The function 9f : R” —>■ R is convex, and therefore it follows from Lemma [201 that 
E [e^^d/] - E [e®/] InE [e®^] < yE [e®^||dV/|p] , 

dm'(d)-m(d)lnm(d) < ^e'^d'^E[e‘^^\\Xf\f]. 

By to the self-bounding property (1^ . 

6m'{9) — m{6)\nm{9) < -9^d'^E[e^^^^\af{X) + b)] 

= [am!{9) hm{9)\. 


Since m(d) > 0, dividing by 9^m{9) gives 


(36) 


d 

Ib 


1 


ad?' 


InTO(d) 



Since m(0) = 1 and m'(0) = Ef{X) = 0, we have 

lim —-^^lnm(d) = 0, 
e^o+ \9 2 ) ^ ' 

and therefore integrating both sides of (IMll from 0 to s gives 
(37) Q - In m(s) < idd^s, 

for s > 0. We can bound the tail probability P{/ > t} with the control (1571) over 
the moment generating function m(s). 

In particular. 


P{/(W) >t} = p{ 


gSfiX) ^ gStj < g- 

= exp[—s< -I- In m(s)] 
bd?s‘^ 


‘E 


sfi.X) 


< exp 


—st + 


2 — asd? _ 

for s £ [0, 2/(ad^)]. Choose s = t/{bd? + ad^t/2) to get 


P{/(X) >(} < exp ■ 










